数据挖掘常用技巧




桂松涛 Blog
songtaogui@163.com



数据挖掘的基本数据类型


矩阵?     列表?     字典?     数据框(DataFrame)!

    

SepalLengthSepalWidthPetalLengthPetalWidthSpecies
5.13.51.40.2setosa
4.93.01.40.2setosa
4.73.21.30.2setosa
4.63.11.50.2setosa
5.03.61.40.2setosa
5.43.91.70.4setosa
4.63.41.40.3setosa
5.03.41.50.2setosa
4.42.91.40.2setosa
4.93.11.50.1setosa
5.43.71.50.2setosa
4.83.41.60.2setosa
4.83.01.40.1setosa
4.33.01.10.1setosa
5.84.01.20.2setosa
5.74.41.50.4setosa
5.43.91.30.4setosa
5.13.51.40.3setosa
5.73.81.70.3setosa
5.13.81.50.3setosa
5.43.41.70.2setosa
5.13.71.50.4setosa
4.63.61.00.2setosa
5.13.31.70.5setosa
4.83.41.90.2setosa
5.03.01.60.2setosa
5.03.41.60.4setosa
5.23.51.50.2setosa
5.23.41.40.2setosa
4.73.21.60.2setosa
4.83.11.60.2setosa
5.43.41.50.4setosa
5.24.11.50.1setosa
5.54.21.40.2setosa
4.93.11.50.2setosa
5.03.21.20.2setosa
5.53.51.30.2setosa
4.93.61.40.1setosa
4.43.01.30.2setosa
5.13.41.50.2setosa
5.03.51.30.3setosa
4.52.31.30.3setosa
4.43.21.30.2setosa
5.03.51.60.6setosa
5.13.81.90.4setosa
4.83.01.40.3setosa
5.13.81.60.2setosa
4.63.21.40.2setosa
5.33.71.50.2setosa
5.03.31.40.2setosa
7.03.24.71.4versicolor
6.43.24.51.5versicolor
6.93.14.91.5versicolor
5.52.34.01.3versicolor
6.52.84.61.5versicolor
5.72.84.51.3versicolor
6.33.34.71.6versicolor
4.92.43.31.0versicolor
6.62.94.61.3versicolor
5.22.73.91.4versicolor
5.02.03.51.0versicolor
5.93.04.21.5versicolor
6.02.24.01.0versicolor
6.12.94.71.4versicolor
5.62.93.61.3versicolor
6.73.14.41.4versicolor
5.63.04.51.5versicolor
5.82.74.11.0versicolor
6.22.24.51.5versicolor
5.62.53.91.1versicolor
5.93.24.81.8versicolor
6.12.84.01.3versicolor
6.32.54.91.5versicolor
6.12.84.71.2versicolor
6.42.94.31.3versicolor
6.63.04.41.4versicolor
6.82.84.81.4versicolor
6.73.05.01.7versicolor
6.02.94.51.5versicolor
5.72.63.51.0versicolor
5.52.43.81.1versicolor
5.52.43.71.0versicolor
5.82.73.91.2versicolor
6.02.75.11.6versicolor
5.43.04.51.5versicolor
6.03.44.51.6versicolor
6.73.14.71.5versicolor
6.32.34.41.3versicolor
5.63.04.11.3versicolor
5.52.54.01.3versicolor
5.52.64.41.2versicolor
6.13.04.61.4versicolor
5.82.64.01.2versicolor
5.02.33.31.0versicolor
5.62.74.21.3versicolor
5.73.04.21.2versicolor
5.72.94.21.3versicolor
6.22.94.31.3versicolor
5.12.53.01.1versicolor
5.72.84.11.3versicolor
6.33.36.02.5virginica
5.82.75.11.9virginica
7.13.05.92.1virginica
6.32.95.61.8virginica
6.53.05.82.2virginica
7.63.06.62.1virginica
4.92.54.51.7virginica
7.32.96.31.8virginica
6.72.55.81.8virginica
7.23.66.12.5virginica
6.53.25.12.0virginica
6.42.75.31.9virginica
6.83.05.52.1virginica
5.72.55.02.0virginica
5.82.85.12.4virginica
6.43.25.32.3virginica
6.53.05.51.8virginica
7.73.86.72.2virginica
7.72.66.92.3virginica
6.02.25.01.5virginica
6.93.25.72.3virginica
5.62.84.92.0virginica
7.72.86.72.0virginica
6.32.74.91.8virginica
6.73.35.72.1virginica
7.23.26.01.8virginica
6.22.84.81.8virginica
6.13.04.91.8virginica
6.42.85.62.1virginica
7.23.05.81.6virginica
7.42.86.11.9virginica
7.93.86.42.0virginica
6.42.85.62.2virginica
6.32.85.11.5virginica
6.12.65.61.4virginica
7.73.06.12.3virginica
6.33.45.62.4virginica
6.43.15.51.8virginica
6.03.04.81.8virginica
6.93.15.42.1virginica
6.73.15.62.4virginica
6.93.15.12.3virginica
5.82.75.11.9virginica
6.83.25.92.3virginica
6.73.35.72.5virginica
6.73.05.22.3virginica
6.32.55.01.9virginica
6.53.05.22.0virginica
6.23.45.42.3virginica
5.93.05.11.8virginica



数据挖掘的基本数据类型


# DataFrame概览
julia> iris
150×5 DataFrame
 Row │ SepalLength  SepalWidth  PetalLength  PetalWidth  Species   
     │ Float64      Float64     Float64      Float64     Cat…
─────┼─────────────────────────────────────────────────────────────
   1 │  5.1         3.5          1.4         0.2         setosa
   2 │  4.9         3.0          1.4         0.2         setosa
  ⋮ │  ⋮           ⋮           ⋮          ⋮           ⋮  
 149 │  6.2         3.4          5.4         2.3         virginica
 150 │  5.9         3.0          5.1         1.8         virginica

# DataFrame基本操作

iris[1, 2]       # df[row, col]
iris[1, 2] = 0   # 赋值单元格
iris.TEST = collect(repeat('A', nrow(iris))) # 添加新列
iris[1:10, ["SepalLength", "Species"]]
iris.Species     # df.ColName
iris."Species"
iris[:, "Species"]
names(iris)
size(iris)
nrow(iris)
ncol(iris)
iris[iris.SepalLength .> 4, :]


数据挖掘的基本思路

"Split-Apply-Combine"策略

数据挖掘的基本思路

"Split-Apply-Combine"策略

# DataFrame概览

julia> iris
150×5 DataFrame
 Row │ SepalLength  SepalWidth  PetalLength  PetalWidth  Species   
     │ Float64      Float64     Float64      Float64     Cat…
─────┼─────────────────────────────────────────────────────────────
   1 │  5.1         3.5          1.4         0.2         setosa
   2 │  4.9         3.0          1.4         0.2         setosa
  ⋮ │  ⋮           ⋮           ⋮          ⋮           ⋮  
 149 │  6.2         3.4          5.4         2.3         virginica
 150 │  5.9         3.0          5.1         1.8         virginica


# 统计每个物种中PetalWidth平均值, 以及PetalLength的中位数

using DataFrames, DataFramesMeta, Chain

@chain iris begin
    groupby(:Species)
    @combine(
        :PetalWidthMean = mean(:PetalWidth),
        :PetalLengthMedian = median(:PetalLength))
end

# >>> Result:
3×3 DataFrame
 Row │ Species     PetalWidthMean  PetalLengthMedian 
     │ Cat…        Float64         Float64
─────┼───────────────────────────────────────────────
   1 │ setosa      0.246           1.5
   2 │ versicolor  1.326           4.35
   3 │ virginica   2.026           5.55